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We present an extension of two policy-iteration based algorithms on weighted graphs (viz., Markov 
Decision Problems and Max-Plus Algebras). This extension allows us to solve the following inverse 
problem: considering the weights of the graph to be unknown constants or parameters, we suppose 
that a reference instantiation of those weights is given, and we aim at computing a constraint on the 
parameters under which an optimal policy for the reference instantiation is still optimal. The original 
algorithm is thus guaranteed to behave well around the reference instantiation, which provides us 
with some criteria of robustness. We present an application of both methods to simple examples. A 
prototype implementation has been done. 

1 Introduction 

We consider the inverse problem initially defined in the context of timed models. More precisely, this 
inverse problem was first formalized in the context of Timing Constraint Graphs [7], and then in the 
context of Timed Automata [1, 3]. We present here this problem in the context of systems modeled by 
directed graphs with (parametric) weights associated to their edges, and more specifically in the cases of 
Markov Decision Processes (MDPs) [4, 8] and Max-Plus Algebras [5]. 

Let us first present the direct problem in this context. The model is given under the form of a directed 
graph G, with weights that are unknown constants or parameters. We also assume that a reference 
instantiation no is given for these parametric weights. Roughly speaking, a policy is a function which 
associates with each state of the graph an action which goes from the state to (a set of) successor state(s). 
Each action has a specific weight. The weight of a path (or sequence of actions) is the sum of the weights 
of its constitutive actions. The value (or cost) of a given policy \i for a given state s corresponds to the 
mean weight of the paths induced by jx, which go from s to a final state of the graph. Given a specific 
instantiation no of the parameters, the direct problem consists in computing an optimal policy, that is a 
policy which gives the minimal value (or maximal value) when the parameters are instantiated with no. 

The optimal policy is classically found using the method of policy iteration (PI) (see [8]). The 
corresponding value is then computed by the value determination procedure (VD) (see, e.g., [5]). We 
show in this paper that the inverse problem can be simply stated, and solved via a natural generalization 
of the procedures of policy iteration and value determination. We focus here on two classes of models: 
Markov Decision Processes and Max-Plus Algebras. 

Given a reference valuation no, the inverse algorithm generalizes the direct algorithm "around" no, 
and infers a constraint on the parameters guaranteeing a similar behavior as under no- This ensures that 
the original algorithm continues to behave well around no, thus giving some criteria of robustness. 
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Figure 1 : Our generic framework 



We first give the general framework of our method (Sect. 2). We then present the adaptation of the 
inverse method to Markov Decision Processes (Sect. 3) and Max-Plus Algebras (Sect. 4). We conclude 
by giving some final remarks (Sect. 5). 

2 General Framework 

2.1 Preliminaries 

Throughout this paper, we assume a fixed set P = {p\,. . . ,pn} of parameters. A parameter instan- 
tiation % is a function % : P — > E assigning a real constant to each parameter. There is a one-to-one 
correspondence between instantiations and points in M. N . We will often identify an instantiation % with 
the point (7t(pi),...,7t(p N )). 

Definition 1 A linear inequality on the parameters P is an inequality e -< e', where -<€ {<, <}, and e,e' 
are two linear terms of the form 

HiOCfPi + d 

where 1 < i < N, (fy € M and d£R. 

A (convex) constraint on the parameters P is a conjunction of inequalities on P. 

We say that a parameter instantiation % satisfies a constraint K on the parameters, denoted by % \= K, 
if the expression obtained by replacing each parameter p in K with 7t(p) evaluates to true. We will 
consider True as a constraint on the parameters, corresponding to the set of all possible instances for P. 

2.2 Overview of the Inverse Method 

We assume given a weighted graph, and an algorithm PI of policy iteration. We define a parametric ver- 
sion of the weighted graph, i.e., a weighted graph whose weights are unknown constants, or parameters. 
Given a parametric weighted graph G and an instantiation % of the parameters, we denote by G\lt] the 
(standard) weighted graph, where the parameters p, have been replaced by their instance 7l(pi). For a 
given graph G, a given reference instance 7To of the parameters, and an optimal policy Hq found by PI for 
G[7To], our goal is to generate a constraint on the parameters such that: 

1. TZq \=K , and 

2. Hq is optimal for G[n], for any instantiation 71 satisfying (i-e., % \= Ko). 

A trivial solution is Ko = {7To}- However, our method will always generate something more general 
than Ko = {Jib}, under the form of a conjunction of inequalities on the parameters (without any constant, 
apart from 0). Given PI, the framework of our inverse method is given in Fig. 1. Given an algorithm PI 
of policy iteration from the literature, calling itself an algorithm VD of value determination, our approach 
can be summarized as follows: 
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1. Compute an optimal policy jj,o for the (standard) weighted graph G[71q], using PI; 

2. Compute a generic value (or generic cost) corresponding to G for the policy /lo, using a parame- 
terized version of VD; 

3. From the generic value computed above, infer a constraint Kq such that /j,q is optimal for G[n], for 
any instantiation n satisfying Ko. 

We now present such an inverse method in the case of two policy-based iteration algorithms. 

3 Markov Decision Processes 
3.1 Preliminaries 

We consider in this section Markov Decision Processes [4] as an extension of weighted labeled directed 
graphs. We associate to every edge of the graph a probability such that, for a given state and a given 
action (or label), the sum of the probabilities of the edges leaving this state through this action is equal 
to 1. Markov Decision Processes are widely used to model, e.g., the power consumption of devices (see, 
e.g., [10]). Formally: 

Definition 2 A Markov Decision Process (MDP) is a tuple M = (S,A,Prob,w), where 

• S = {si , . . . , s n } is a set of states; 

• A is a set of actions (or labels); 

• Prob : S x A x S — > [0, 1] is a probability function such that Prob(s\,a,S2) is the probability that 
action a in state s\ will lead to state S2, andMs G 5,Va G A : Jl s ' eS Prob(s,a,s') = 1; 

• w : S x A — ► R jj a weight function such that w(s, a) ( also denoted by w a (s) ) is the weight associated 
to the action a when leaving s. 

In the following, we consider the MDP M = (S,A,Prob,w). Given a state s G S, we denote by e(s) 
(for enabled) the set of possible actions for s, i.e., {a G A | 3s' G S : Prob(s,a,s') > 0}. We suppose that, 
for any state s G S, e(s) / 0. We also suppose that M has a unique "absorbing state", i.e., a state which 
is reachable (with positive probability) from any other state for any policy, and which has a self-loop 
outgoing transition with weight and probability 1 . We suppose in the following that the absorbing state 
is s n - For the sake of simplicity, we will not depict, in the graphs describing MDPs in this paper, the 
self-loop outgoing transition of the absorbing state. 

In every state s of S\ {s n }, we can choose non-deterministically an action a in e(s). Then, for 
this action, the system will evolve to a state s' such that Prob(s,a,s') > 0. A way of removing non- 
determinism from an MDP is to introduce a policy jj,, i.e., a function from states to actions. A policy 
is of the form ii = {s\ — > , S2 — ► a,- 2 , . . . , s n -\ — > a, n _, }, with a ;i , . . . , a In ] G A. We denote by ii [s] the 
action associated to state s. The MDP, associated to a policy, behaves as a Markov chain [9]. 

Given a policy jj,, the associated value is a function mapping each state s to the mean sum of weights 
attached to the paths induced by jj,, which go from s to s n . (By convention, the value associated to s n 
is null.) A classical problem for MDPs is to find an optimal policy, i.e., a policy under which the value 
function is maximum (or minimum), for every s G S. Note that, under the assumption of the existence of 
an absorbing state, such an optimal policy always exists, but is not necessarily unique (see, e.g., [8]). We 
focus here on finding an optimal policy for which the value function is minimal. 

We give in Fig. 12 in Appendix A the classical algorithm mdpPI for policy iteration on MDPs. This 
algorithm computes the optimal policy for an MDP, and it makes use of the algorithm mdpVD for value 
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Figure 2: An example of Markov Decision Process 



determination in MDPs (see Fig. 1 1 in Appendix A). Given an MDP and a policy, this second algorithm 
computes the mean sum of weights attached to the paths reaching s n , for every starting state in S\ {s n }. 
We denote by v[s] the value associated to state s. The value v computed by Algorithm mdpVD is obtained 
by solving a system of linear equations, and is computed by applying the inverse of a real- valued matrix 
to a parametric vector. The fact that there is a single solution to this system is due to the fact that the 
matrix is invertible, which comes itself from the existence of an absorbing state. 

3.2 An Illustrating Example 

Consider the case of a researcher getting by train from Paris to Bologna. He can either take a night train 
Corail, or use the French high-speed train TGV. When there is no strike impacting the TGV service, the 
TGV usually needs 7 hours to go from Paris to Milan (with probability 4/5). It is then possible to take 
an Italian train, reaching Bologna from Milan in 1 hour with probability 1 . However, in case of strike 
(with probability 1/5), the TGV does not leave Paris, and the researcher should wait 7 more hours until 
the next TGV. Note that this next TGV may also be on strike (with the same probability 1/5), and so on. 
The night train can not be impacted by any strike, and it goes directly from Paris to Bologna in 1 1 hours 
with probability 1. 

The MDP depicted in Fig. 2 summarizes those different possibilities, where P stands for Paris, M for 
Milan and B for Bologna. We denote by "TGV (4/5) 7" a transition using label TGV with probability 
4/5 and weight 7 (i.e., 7 hours). Note that the only source of non-determinism is in state P, where it is 
possible to choose between the TGV and the Corail actions. 

We are first interested in the following question: considering the probability of strike, what is the best 
option, i.e., should we use the TGV or the night train ? This problem corresponds to finding an optimal 
policy for this MDP, i.e., a policy minimizing the global weight of the system w.r.t. the probabilities. An 
application of the (standard) algorithm mdpPI [8] (see Fig. 12 in Appendix A) to the MDP modeling the 
train journey from Paris to Bologna gives the following optimal policy: ii = {P — > TGV,M — > Train} 1 . 
For this policy, the value for state P (given by the last call to Algorithm mdpVD), i.e., the expected time 
to reach Bologna, is 9.75. 

We now suppose that the train between Milan and Bologna can be subject to delays due, e.g., to 
some works on the track. Our problem is the following: until which delay of the train between Milan and 
Bologna the option "TGV" in Paris remains the best option? In other words, until which delay of the train 
between Milan and Bologna the optimal policy remains optimal? We are thus interested in computing a 

1 As B is the absorbing state, recall that we do not define a policy for it. 
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constraint on all the delays of the system, viewed as parameters, such that, for any instantiation of this 
constraint, the policy ii remains the optimal policy for this MDP. 

3.3 The Algorithm P-mdpPI 

We first adapt the notion of MDP to the parametric case. We now consider that the weights of the MDP 
are parameters. 

Definition 3 Given a set P of parameters, a Parametric Markov Decision Process (PMDP) is a tuple 
M = (S,A,Prob,W), where 

• S = {si, . . . ,s n } is a set of states; 

• A is a set of actions; 

• Prob : S x A x S — > [0, 1] is a probability function such that Prob(s\,a,S2) is the probability that 
action a in state si will lead to state S2, andVs £ 5,Va G A : l ls i e sProb(s,a,s') = 1; 

• W : S x A — > P is a parametric weight function such that W(s,a) (also denoted by W a (s)) is a 
parameter associated to the action a when leaving s. 

We consider in the following the PMDP M = (S,A,Prob,W). Given an instantiation % of the pa- 
rameters, we denote by W[n] the function from S x A to R obtained by replacing each occurrence of a 
parameter pi in W with the value n{pi), for 1 < i < N. By extension, we denote by M[n] the (standard) 
MDP (S,A,Prob,W[n}). 

We first introduce the algorithm P-mdpVD, given in Fig. 3, which computes, given a policy ji., the 
parametric value associated to every state s (i.e., the mean sum of the parametric weights of paths induced 
by jU going from s to s n ). This algorithm is a straightforward adaptation to the parametric case of the 
classical algorithm mdpVD of value determination for MDPs (see Fig. 1 1 in Appendix A). We denote by 
V [s] the parametric value associated to state s. 

ALGORITHM P-mdpVD(M,ii) 

Input M : Parametric Markov Decision Process (S,A,Prob,W) 

il : Policy 
Output V : Parametric value function 

SOLVE {V[s} = W fl[s] (s) + Z s , €S Pwb( S ,n[ S ], S ') x V[/]}, £S \ { .„ } 

Figure 3: Algorithm for parametric value determination for MDPs 

The value V computed by this algorithm P-mdpVD is obtained by solving a system of linear equa- 
tions. Since this system is of the form V = A x V + B, it is equivalent to V = (1 — A) -1 x B, and can 
be implemented using the inversion of matrix (1 — A). Note that this matrix A is computed from matrix 
Prob and vector jx, and is therefore a constant real- valued matrix (i.e., containing no parameters). As for 
the algorithm mdpVD, the fact that there is a single solution to this system comes from the existence of 
an absorbing state. Note also that the parametric value associated to a state is a linear term, as defined in 
Def. 1. 

We state in the following Lemma that, given M and jj,, the instantiation with % of the parametric 
value associated to M w.r.t. il is equal to the value associated to M[n] w.r.t. jj,. We use V[n] to denote 
the parametric value V instantiated with %. 
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ALGORITHM P-mdpPI(M, juo) 
Input M : Parametric Markov Decision Process (S,A,Prob, W) 

/A) : Optimal policy for the reference instantiation of the parameters 
Output Ko : Constraint on the set of parameters 

V := P-mdpVD(M,Ho) 
Kq := True 

FOREACHseS\{s„}DO 

FOR EACH a G e(s) s.t. a / jUo[j] DO 

Kq :=KoA{W a (s)+^ sleS Prob(s,a,s')V[s'} > V[s}} 

OD 

OP 

Figure 4: Algorithm solving the inverse problem for MDPs 

Lemma 1 Let M = (S,A,Prob,W) be a PMDP, % an instantiation of the parameter, and ]X a policy for 
M. Let V = P-mdpVD(M,n). Then V[n] = mdpVD(M{n\,ii). 

Proof. The algorithm P-mdpVD(M,jj,) consists in solving a system of the form V = A x V + W^. 
Hence, V = (1 — A) -1 x W^ s y Moreover, the algorithm mdpVD(M[7l], ju) consists in solving a system of 
the form v = A' x v + W^J^u, i.e., v = (1 — A') -1 x W^J^m. It is easy to see on the two algorithms that 
A = A'. We trivially have: for all s, W[n]^(s) = (W^u^))^], where (W^u(j))[^] denotes the linear 
term W^(s) where every occurrence of a parameter /?, was replaced by its instantiation 7T,-. Hence, 
V[n] =mdpVD(M[n],n). D 

We now introduce the algorithm P-mdpPI, which fits in our general framework of Fig. 1. Given a 
reference instantiation 7Cq of the parameters, this algorithm takes as input a PMDP M, and an optimal 
policy iiq associated to M[hq\ (which can be computed using mdpPI(M[%o\)). Recall that, by "optimal", 
we mean here a policy under which the value of states is minimal. The algorithm outputs a constraint Kq 
on the parameters such that: 

1. 7T \= K o, and 

2. for any n \= Kq, \Iq is an optimal policy of M[n]. 

The algorithm P-mdpPI is given in Fig. 4. We can summarize this algorithm as follows: 

1. Compute the parametric value function, which associates to any state a parametric value w.r.t. jj,o, 
using Algorithm P-mdpVD; 

2. For every state s / s n , for every action a different from the action Ho[s] given by the the optimal 
policy, generate the following inequality stating that a is not a better action (i.e., an action which 
would lead to a better policy) than jUq[j] : 

W a (s) + £ Prob(s,a,s')V[s'] > V[s] 

s'es 

The above set of inequalities implies that, for any s and a, the policy obtained from by changing /Xq [s] 
with a, does not improve policy iiq (i.e., does not lead to any smaller value of state). 
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3.4 Properties 

We first show that 71q models the constraint Kq output by our algorithm. 

Proposition 1 Let iiq = mdpPI(M[no\), and Kq =P-mdpPI{M,iio). Then n |= K . 

Proof. (By reductio ad absurdum) Suppose %q Y= Kq. Then, there exists an inequality J in Kq such 
that 7To y= J- By construction, this inequality / is of the form W a (s) + Y. s '€S Prob ( s i a i s 'W[ s '} > V[s], for 
some s and some a. If this inequality / is not satisfied by 71q, this means that a is a strictly better policy 
for s than the policy /j,q [s] in M[hq] , which is not possible since jj^ is an optimal policy for M[kq] . □ 

Proposition 2 Algorithm P-mdpPI terminates. 

Proof. Since M contains exactly one absorbing state, the computation of the parametric value in 
P-mdpVD is guaranteed to terminate with a single solution. Since the number of generated inequalities 
is finite, it is easy to see that Algorithm P-mdpPI terminates. q 

Note that the size (in term of number of inequalities) of the constraint Kq output by our algorithm is 
in 0(\S\ x |A|), where \S\ (resp. denotes the number of states (resp. actions) of M. 

We now state that our algorithm P-mdpPI solves the inverse problem as described in Sect. 2.2. 

Theorem 1 Let iiq = mdpPl(M[it Q ]), and Kq = P-mdpPl(M,ti ). Then: 

1. Kq\= Kq, and 

2. for all K \= Kq, policy Ho is optimal for M\k\. 

Proof. Let us prove item (2) by reductio ad absurdum. Recall thatM= (S,A,Prob, W). Let^^A^. 
We have M\%\ = (S,A,Prob,W[n}). 

Suppose that iiq is not an optimal policy for M[%]. Let ii be an optimal policy for M[n]. Then 
there exists some state s such that jj,[s] is a strictly better policy than /j,q[s] for M[n]. Let a = jj,[s] and 
ciq = }1q[s\. Let v = mdpVD{M[n],ii). Since a is a strictly better policy than ao for state s in M[n], 
then, from the last iteration of Algorithm mdpPI(M[n}), we have: W[n] ao (s) +Y,s'esP r °b( s , a o, s ')v[ s '] > 
W[7c] a (s) +Z s >esProb(s,a,s')v[s'}. 

Moreover, since a / Ho[s], Algorithm P-mdpPI{M,\lo) generates the following inequality in Kq: 
W a (s) + Zs'es Prob ( s , a i s ') v [ s '\ > V[s]. Since V[s] = W ao (s) + Zs> eS Prob(s,a ,s') x V[s'} (from the call 
to Algorithm P-mdpVD(M, /j.q)), this inequality is equal to W a (s) + Y,s J esP r °b(s,a,s')V[s'} > W aQ (s) + 
Es 1 esP r °b( s \ a 0,s') x y[ s ']- Since % |= Kq, the instantiation of Kq, and in particular of this inequality, 
with % should evaluate to true. By Lemma 1, we have V[n] = v. Hence, by instantiating the inequality 
with 7T, we get: W[7r] a (5') + J^ s / eS Prob(s,a,s r )v[s'] > W[n] ao (s) + J^ s / eS Prob(s,ao,s r ) x v[s'], which is 
exactly the contrary of what was stated before. 

□ 

3.5 Application to the Example 

Consider again the journey from Paris to Bologna described in Sect. 3.2. We give in Fig. 5 the PMDP 
M adapted from Fig. 2 to the parametric case. The set of parameters is P = {p\,p2,pi}. The reference 
instantiation 71q of the parameters is the following one 2 : 
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Figure 5: An example of Parametric Markov Decision Process 



Pi = 7 P2 = 11 P3 = 1 
Note that M[7To] corresponds to the (standard) MDP depicted in Fig. 2. 

Let us briefly explain the application of P-mdpPI to this example. We first compute the optimal policy 
Ho for M[hq]. As said in Sect. 3.2, /i = {P — > TGV,M — > Train}. Applying Algorithm P-mdpVD(M, jUo), 
we then compute the parametric value of each state w.r.t. the optimal policy /lo- As B is an absorbing 
state, we have V[fi] = 0. Thus, we trivially have V[M] = p 3 . We then have V[P] = W^ P ](P) + 1/5 x 
V[P] +4/5 x V[M], which gives V[P] = 5/4 x p\ + P s. Note that, by replacing the parameters p, by 
Tlo(pi) in V[P] for i = 1,2,3, we get 5/4x7 + 1= 9.75, which is equal to the value computed by the 
classical algorithm mdpPI (from Lemma 1). 

We now compute the constraint Kq. The only non-determinism being in state P, we generate the 
following inequality: 1 X (p2 + V[B\) > V[P], which gives: 

5 

P2 > ^P\ +P3 

By instantiating all the parameters except the one corresponding to the duration of the train between 
Milan and Bologna (i.e., pi), we get the following inequality: 

9 

P3< 4 - 

Thus, if the train between Milan and Bologna takes more than 2hl5 (i.e., is impacted by a delay of more 
that lhl5), then the optimal policy of the TGV will not be optimal anymore, and we should consider 
another option. 



Remark. This example being simple, it was rather easy to predict this result from the direct application 
of the classical algorithm mdpPI to the MDP described in Fig. 2. Indeed, the expected value v[P] in 
state P is equal to 9.75 so, if a delay of more than 11 —9.75 = 1.25 (i.e., Ihl5) occurs somewhere 
between Paris and Bologna using the TGV option (in particular between Milan and Bologna), the TGV 
policy will not be optimal anymore. Our algorithm P-mdpPI is of course interesting for more complex 
systems. 

2 From the definition of the MDP and PMDP, the weight corresponding to leaving state P through action TGV must be the 
same for any destination state. This is the reason why, in state P, the duration corresponding to waiting the next train (7 hours) 
is the same as the time needed to reach Milan. In the case where we would need different weights, it is possible to set an average 
value for the weight by taking into account the respective probabilities. 
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3.6 Implementation 

The algorithm P-mdpPI has been implemented under the form of a program named ImpRator (stand- 
ing for Inverse Method for Policy with Reward AbstracT behaviOR). This program, containing about 
4300 lines of code, is written in Caml, and uses matrix inversion to compute the parametric value V in 
Algorithm P-mdpVD. We applied our program to various examples of MDPs modeling devices. For a 
system containing 1 1 states, 4 actions and 132 transitions, corresponding to the model of a robot evolving 
in a bounded physical space [1 1], our program ImpRator generates a constraint in 0.17 s. 
The program and various case studies can be downloaded on the ImpRator Web page 3 . 

4 Max-Plus Algebra 

We consider in this section the algorithm 4.4 "Max-Plus Policy Iteration" of [5] (which will be here 
denoted by maxPI), used to compute the maximal circuit mean of a weighted directed graph in the 
framework of max-plus algebra. We are interested in computing a constraint on the weights attached 
to a directed graph, such that the circuit of maximal mean remains the same, under any instantiation 
satisfying this constraint. 

We use in this section a formalism similar to the one in [5] 4 . 

4.1 Preliminaries 

The max-plus semiring R max is the set lu{-t»}, equipped with max and +. The zero element will be 
denoted by e (e = — °°). The unit element will not be used in this paper. 

Definition 4 A directed weighted graph (or DWG) G is a triple (S,E,w), where: 

• S is a finite set of states, 

• E is a set of oriented edges E C S x S, 

• w : E — ► R is a function associating to every edge a real-valued weight. 

We denote by w(e), or alternatively by w,j, the weight associated to the edge e = (i,j). We associate to 
G a matrix M G ($&max) SxS , such that 



Conversely, we associate to any matrix M G W xn the graph Gm = (S,E,w), where S = {1, . . . ,n}, E = 
{(/, j) eSxS\ Mjj / e}, and Wij = Mijfor any (i,j) G S x S. 

In the following, we will mainly consider the formalism of matrices rather than the graphs. We 
consider in the following the matrix M, whose associated graph Gm is strongly connected. 

Definition 5 Given a DWG G = (S,E,w), the maximal circuit mean is 



where the max is taken over all the circuits c of G, and the sums are taken over all the edges e of c. 
3 http : //www . lsv . ens- cachan . f r/~andre/ImPrator/ 

4 However, we will denote the policy by fl instead of n, both in order to keep the formalism introduced previously and in 
order to avoid confusion with K, standing in our framework for an instantiation of the parameters. 




p = max 



c 
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Note that, in the definition of p , the numerator is the weight of c, and the denominator is the length 
of c. 

In the context of DWGs, given a matrix M, a policy is a function 11 from S to £, such that for all i £ 5, 
ju[/] is an edge starting from z. In the following, without loss of understanding, we will sometimes 
abbreviate the edge 11 [i] = (i, j) as its target state j. Given a policy 11 for M, we denote by 11 [i] the policy 
associated to state i. Moreover, we denote by the matrix such that, for any Mf- = Mjj if j = 
and Mf\ = e otherwise. 

Given a matrix M and a policy 11, the value junction, denoted by (rj ,x), associates to each state / of S 
a couple (rji,Xj) 6lxl (called "(generalized) eigenmode" in [5]). 

An optimal policy ii for M induces a circuit c of maximal mean in graph G. More precisely, ii [i] is 
an edge of c if i belongs to c, and there is a path from i to a state of c otherwise. Moreover, the associated 
value (t],x) is such that all the T],s are identical, and equal to the maximal circuit mean p of G. 5 

The algorithm maxPI (see Fig. 15 in Appendix B) computes an optimal policy for a given DWG. 
Starting from an arbitrary policy, it iteratively improves the current policy using Algorithm maxPImpr 
(see Fig. 14 in Appendix B) and Algorithm maxVD (see Fig. 13 in Appendix B), which computes the 
associated value function (rj,x). 



4.2 An Illustrating Example 



We give in Fig. 6 an example of DWG (coming from [5]) with its corresponding matrix. We are interested 
in finding the maximal circuit mean of a DWG. Let us briefly apply Algorithm maxPI to the matrix M 
of Fig. 6. As in [5], we choose the initial policy 7Ti : 1 — ► l,i — ► 2, for i = 2,3,4. Applying Algorithm 
maxVD, we find a first circuit c\ : 1 — ► 1, with ff = w(c\) = 1. We set r\\ = 1, and x\ = 0. Since 1 is the 
only state which has access to 1, we apply algorithm maxVD to the subgraph of Gm with states 2, 3, 4. 
We find the circuit C2 : 2 — > 2 and set tj = w{c2) = 3, Tjj = 3, and x\ = 0. Since 3, 4 have access to 2, we 
set T]/ = 3 for i = 3,4. Moreover, an application of (7) yields x\ =4 — 3 +x\, and x\ = 2 — 3 + x\. To 
summarize: 



( 1 \ 

3 
3 

V 3 y 



/ \ 




V -i / 



5 Note that the tj,s are also equal to the (unique) eigenvalue of M, and x is an eigenvector of M (see Theorem 3.1 in [5]). 
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Figure 7: The graph corresponding to the matrix 



We improve the policy using Algorithm maxPImpr. Since J = {1} ^ 0, we have a type 3a improvement. 
This yields %i : i — >2 for i = 2,3,4. Only the entry 1 of* 1 and rj 1 has to be modified, which yields 



/ 3 \ 



r 



/ - 1 \ 



3 2 _ 

3 ' X ~ 1 

V 3 / V -i J 

We tabulate with less details the end of the run of the algorithm. Algorithm maxPImpr, type 3b, policy 
improvement. 713 : 1 — ► 4,2 — > 3,3 — > 2,4 — ► 3. Algorithm maxVD. Circuit found c : 3 ^ 2 — > 3, tf = 

(w 2 ,3+W3 i2 )/2 = 9/2. 

m / ¥ \ 

~3 _ 



V / 







V 3 J 



Algorithm maxPImpr, type 3b, policy improvement. The only change is 714(3) = 4. Algorithm maxVD. 
Circuit found c:3^4^3,T] = (w$ 4 + W4 3) /2 = 11/2. 



n 

2 

11 

2 



/ 4 \ 

_ 1 

2 



v 1 / 



Algorithm maxPImpr. Stop. Hence, we get the following result: 



n 

2 
n 
2 
n 
2 
n 
2 



( 4 \ 



_i 

2 



5 
2 



V 1 / 



/4\ 
3 
4 

V 3 ) 



(1) 



Thus, 1 1/2 is an eigenvalue of M, and * is an eigenvector. The subgraph of M restricted to the 
policy 11 is given in Fig. 7. We note that the mean of circuit 4 — > 3 — ► 4 is (8 + 3)/2 = 11/2, and it is 
easy to check that this circuit has the maximal circuit mean of the graph associated to M. 

We are now interested in the following problem. Suppose that one wants to minimize the weight 
associated to the edge 4^3 (of weight W43 = 8). What is the minimal value for ^4,3 so that circuit 
4 — > 3 — ► 4 remains the circuit of maximal mean in the graph M ? In other words, we are interested in 
computing a constraint on the weights of the system, viewed as parameters, so that the circuit of maximal 
mean remains the circuit of maximal mean. 
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4.3 The Algorithm P-maxPI 

We first adapt the notion of DWG to the parametric case. We now consider that the weights of the graph 
are parameters. 

Definition 6 Given a set P of parameters, a parametric directed weighted graph ( or PDWGj G is a triple 
(S,E,w), where: 

• S is a finite set of states, 

• E is a set of oriented edges E C S x S, 

• W : E — > P is a parametric function associating to every edge a parametric weight. 

We denote by W(e), or alternatively by W{j, the parametric weight associated to the edge e = (i,j). We 
associate to G a parametric matrix M € (P U e) SxS , such that 



Conversely, we associate to any parametric matrix M G (PUe)" x " the graph Gm = (S,E,W), where 
S = { 1 , . . . ,n}, E = {(/, j) £ S x S | Mij / £}, and W { j = Mi j for any (i, j) £ S x S. 

We consider in the following the PDWG (S,E, W), and its associated matrix M. Given an instantia- 
tion it of the parameters, we denote by W\k\ the weight function from E to R obtained by replacing each 
occurrence of a parameter pi in W with the value it(pi), for 1 < i < N. Similarly, we denote by M[n] 
the matrix (in (M. ma x) nxn ) obtained by replacing in M each occurrence of a parameter p, with the value 
7l{pi), for 1 < i < N. The notion of policy can be extended to the parametric framework in a natural way. 

Following the idea of our framework of Sect. 2, we first give in Fig. 8 the algorithm P-maxVD. 
This algorithm is an adaptation to the parametric case of the algorithm for value determination maxVD 
from [5] (see Fig. 13 in Appendix B). Given a policy ji, it computes a parametric eigenmode (H,X) of 
Af*. In other words, it associates to every state i of two parametric values Hi and Xj, which are two 
linear terms (as defined in Def. 1). 

We now introduce the algorithm P-maxPI, which fits in our general framework of Fig. 1. We give 
the algorithm P-maxPI in Fig. 9. As in the MDP Section, we first apply the standard algorithm for policy 
iteration from the literature, i.e., we first call Algorithm maxPI, given in Fig. 15 in Appendix B (which 
makes itself use of Algorithms maxVD and maxPImpr, available in Fig. 13 and Fig. 14 respectively). 
This algorithm computes the eigenmode (tj,jc) of the maximal circuit mean of M, and the corresponding 
policy jUo- Then we compute the parametric eigenmode of M associated to iIq, using Algorithm P- 
maxVD. Finally, we compute a set of inequalities ensuring that the policy Ho is the optimal policy w.r.t. 
maximal circuit mean. This generation of inequalities is the adaptation to the parametric case of the test 
of optimality performed in the classical algorithm maxPImpr (given in Fig. 14 in Appendix B). 

We now state that our algorithm P-maxPI solves the inverse problem as described in Sect. 2.2. 

Theorem 2 Let ((tj,jc),jUo) = maxPl(M[n^\) and K =P-maxPI(M, ((tj,jc), jUo))- Then: 

1. 7to\= Kq, and 

2. for all % |= Kq, policy ilo corresponds to a maximal mean circuit ofM\%\. 

Note that, although we guarantee that the circuit of maximal mean in M[n] is always the same, for 
any n \=K, the mean value itself varies with it. 





e otherwise 
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ALGORITHM P-moxVD (M,ii) 




Input M : Matrix 
jX : Policy 
Output (H,X) : Parametric eigenmode of 




1. Find a circuit c in the graph of M^. 




2. bet 

F _I,ecW(e) 
tLeec 1 




3. Select an arbitrary state i in c, set Hi := //, and set X,- to an arbitrary value, say X 


= 0. 


4. Visiting all the states j that have access to i in backward topological order, set 




Hj := H 


(2) 


X J := w j,n(j)~ H+x nU) 


(3) 


5. If there is a nonempty set C of states j that do not have access to i, repeat the algorithm 
using the C x C submatrix of M and the restriction of ii to C. 



Figure 8: Algorithm for parametric value determination for maximal circuit mean 



ALGORITHM P-maxPI(M, ((j],x),Hq)) 




Input M : Matrix 

((?] ,jc), Ho) : Eigenmode and policy optimal for the reference instantiation 
Output Kq : Constraint on the parameters 
Variables (H,X) : Parametric eigenmode of M 




(H,X) :=P-maxVD(M,ii Q ) 
Ko '.= True 

FOR EACH i,j s.t. M u / e DO 




«b := ' 


f KoA{Hj>Hi} if J] 7 >7] ; 
I KoA{Hj<Hi} if7]j<r]i 


(4) 


Ko := < 


[ Ko A { (W,- y - ^ + Xy) > X,-} if TJy < TJ ; - A (wy - TJy + Xy) > X t 
[ Ko A { (Wij - Hj+Xj) < X,} if TJy < 7]j A ( WiJ - TJy +xy) < X; 


(5) 


OD 







Figure 9: Algorithm solving the maximal circuit mean inverse problem 
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1 -► 1 


2^3,4 + 2^4.3 < 2^3.4+2^4,3 
W U - \ W 3 .4 - 2^4,3 + W 1A - 2^3,4 - 2^4.3 + W4.3 - 2^3,4 - 3^4,3 < W 1A " 2^3,4 " 2^4,3 + ^4.3 - 3^3,4 - ^4.3 


1 ->2 


^3,4 + ^4.3 < 2^3.4+2^4,3 
Wi,2 " 2^3,4 - 2^4.3 + W 2 .3 - \w 3A - 2^4,3 < Wi, 4 - ^3.4 - 2^4,3 + W4.3 - 2^3,4 - 2^4.3 


1 ->4 


2-W 3 ,4 + k W4.3 < 2^3,4+2^4,3 
W 1A - ^3,4 - 2^4,3 + ^4,3 - 2^3,4 - 2^4,3 < ^1,4 - 2^3,4 - 2^4,3+^4,3 " 2 W 3,4 - 2^4,3 


2^2 


jW3,4 + 7^4,3 < ^3,4+^4,3 
W 2 ,2 - 2^3,4 - 2^4,3 + W 2 ,3 - 2^3,4 - 2^4,3 < W 23 - jW 3 A - 2 W 4 ,3 


2^3 


2^3,4 + 2^4.3 < 2^3,4+3^4,3 
W 2 ,3 " 2^3,4 " 2^4.3 + < W 2 ,3 - 2^3.4 - 2^4,3 


3^2 


2^3, 4 +2-W 4 ,3< 2^3,4+2^4,3 
^3,2 - 2^3,4 - 2^4,3 + ^2,3 - 2^3,4 - 2^4,3 < 


3^4 


iW3,4 + iW 4 ,3 < ^3,4+^4,3 
W3.4 - 2^3,4 - 2^4.3 + W43 - 2^3,4 - 3^4,3 < 


4^2 


iW3 ;4 + m, 3 < ^3,4 + ^4,3 
^4,2 " 2^3,4 ^ 3^4,3 + ^2,3 - \W 3A - 2^4,3 < W 4 ,3 - \ W 3 ,4 - 2^4,3 


4^3 


^3,4 + 2^4,3 < 2^3.4+2^4,3 
W 4 ,3 " \W 3A - \ W4.3 + < W 4 ,3 - \ W3.4 - 2^4,3 



Figure 10: The generation of Kq for our example of graph 



4.4 Application to the Example 

Let us apply the algorithm P-maxPI given in Fig. 9 to the graph from [5] depicted in Fig. 6 in Sect. 4.2. 
We first apply Algorithm maxPl, which gives the result (1) of Sect. 4.2. 

Then, we call Algorithm P-maxVD. The circuit c found is 3 — > 4 —> 3. We set H = (W 3A + W4 3) /2. 
We then pick up, say, state 3 in c, set #3 := H and X3 := 0. Then, visiting all the states j that have access 
to i in backward topological order, we have: 

• For state 1: H\ = H, and X x = Wi A -H+X 4 

• For state 2: H 2 = H, and X 2 = W 2 , 3 -H+X 3 

• For state 4: #4 = H, and X 4 = W43 -H+X 3 

Since the set C of states 7 that do not have access to i is empty, the algorithm P-maxVD terminates. After 
resolution of the system above, we get 

/ W h4 -W 3A \ 

^,3-^3,4-^4,3 



V iW 4 ,3-iW 3 ,4 / 

We now generate the inequalities. For every edge (1,7) of the graph, we generate two inequalities, 
i.e., inequalities (4) and (5) of the algorithm P-maxPI. All generated inequalities, including the trivial 
ones (i.e., of the form a < a, for some linear term a), are depicted in Fig. 10. For every edge of 
the graph, we first give the inequality corresponding to (4), and then the inequality corresponding to (5). 
The conjunction of those inequalities gives the constraint Kq output by the algorithm. 

After simplification (trivially done by hand) of the constraint Kq, we get the following constraint: 



H 



W 3 ,4+W 4 ,3 
W 3 ,4+W4,3 

W 3A +W 4t3 
2 

W 3A +W 4t3 
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2W U < W 3) 4 + W4,3 

A W l<2 + W23 < W\ .4 + W 43 
A 2W 2 2<W 3A + W 4 j 

A W 2 ,3 + W 3 ,2 < W3.4 + W 43 

A 2W 2 3 + 2W 4 ,2 <W3,4 + 3W 4j3 

Recall that we were interested in Sect. 4.2 in knowing until which value it was possible to minimize 
W43 so that the circuit 4 — > 3 — > 4 remained the circuit of maximal mean in the graph M of Fig. 6. Let us 
instantiate all parameters except W43 in the constraint output by P-maxPI(M, ((t],x),Ho)). We then get 
the following inequality: 

W43 > 6 

Thus, provided this inequality is verified, the circuit 4 — > 3 — ► 4 remains the maximal mean circuit in the 
graph M of Fig. 6. Note that it is actually easy to see on the graph in Fig. 6 that, if W43 < 6, the maximal 
mean circuit then becomes 2 — > 3 — > 2, with 77 =9/2. 

5 Final Remarks 

We have presented an extension of two algorithms based on policy-iteration for two models: Markov De- 
cision Problems and Max-Plus Algebras. For these models, we introduced a natural generalization of the 
policy-iteration method that solves the inverse problem, i.e: considering the weights of the models to be 
unknown constants or parameters, and given a reference instantiation 7To of those weight parameters, we 
compute a constraint under which an optimal policy for 7To is still optimal. This increases our confidence 
in the robustness of policy-iteration based methods. 

This inverse method was also experienced in another kind of weighted graphs, i.e., directed weighted 
graphs: in this context, we generate a constraint on the weights seen as parameters, guaranteeing that the 
shortest path from one state to another one remains the shortest path [2]. 

Such an extension seems to work on several other policy-iteration algorithms. In particular, we are 
studying the adaptation of the method to Markov decision processes with two weights, as used in the 
problem of dynamic power management [10] for real-time systems where one wants to minimize the 
power consumption while keeping a certain level of efficiency. We also plan to adapt the method to an 
extension of Algorithm maxPI allowing to treat deterministic games with mean payoff [6]. 

Acknowledgments. We thank an anonymous referee for his/her helpful comments. 
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A Markov Decision Processes Algorithms 



ALGORITHM mdpVD{M,ii) 

Input M : Markov Decision Process (S,A,Prob, w) 

il : Policy 
Output v : Value function 

SOLVE {v[s\ = Wlx[s] {s)+Zs'esProb{s,il[sU) x v[s ! ]} seS \ Sn 



Figure 1 1 : Algorithm for value determination for MDPs 



ALGORITHM mdpPI(M) 

Input M : Markov Decision Process (S,A,Prob,w) 
Output jj, : Policy optimal w.r.t. w (initially random) 
v : Value function 

REPEAT 

v := mdpVD(M,ii) 

fixpoint := True 

for each s e S\s n DO 

optimum :=v[s] 

for each a G e(s) DO 

IF w a (s) +Y tS i e s Pwb ( s > a ' s ') v [ s '] < optimum THEN 
optimum := w a (s) +'£s'eS t ~' ro b( s i a i s ') v [ s '} 
H[s] := a 
fixpoint := False 

FI 

OD 

OD 

UNTIL fixpoint 



Figure 12: Algorithm of policy iteration for MDPs 
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B Maximal Circuit Mean Algorithms 



ALGORITHM maxVD{M,ii) 
Input M : Matrix 
tl : Policy 
Output (tj,x) : Eigenmode of 

1. Find a circuit c in the graph of M^. 

2. Set 

y 1 

3. Select an arbitrary state i in c, set tj,- := tj, and set x, to an arbitrary value, say := 0. 

4. Visiting all the states j that have access to i in backward topological order, set 

rjj := tj (6) 

5. If there is a nonempty set C of states j that do not have access to i, repeat the algorithm 
using the C x C submatrix of M and the restriction of ii to C. 



Figure 13: Algorithm of value determination for maximal circuit mean in max-plus algebras 
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ALGORITHM maxPImpr(M, 11 , (rj ,x) ) 

Input M : Matrix 

il : Former policy 
(rj,x) : Eigenmode of 

Output jj,' : New policy 




1. Let" 

J={i 


max rjj > tj,} 

(i'J)e £ 


K(i) := argmaxT7 7 , for i = 1, . . . ,n, 


I = {i\ max (w(e) — rjj+Xj) > x,-} 

e={i,j)eK(i) 


L(i) := argmax (w{ 

e=(i,j)eK(i) 


e) — rjj + jc/) , for i = 1 , . . . , «. 


2. If / = / = 0, (tj ,x) is an eigenmode of M. 


3. (a) If/ we set: 




H'(i) := < 


any e G X'(i) if i € J 
ju(i) if i 0/ 


(b) If / = ©but 7/0, we set: 




M'(0 := < 


any e G L(i) ifiel 
jll(i) if z 0/ 


^Recall that by argmax ee£ /(e), we mean as usual the set of elements me£ such that f(m) = max<, e £ /(e). 



Figure 14: Algorithm of policy improvement for maximal circuit mean in max-plus algebras 



ALGORITHM maxPI{M) 
Input M : Matrix 

Output n : Optimal policy (initially arbitrary) 
Variables (tj ,x) : Eigenmode of 
jj,' : Former policy 

DO 

(tj,x) := maxVD(M,ii) 
H' :=fi 

il := maxPImpr(M,ii,(rf,x)) 
UNTIL /x = ju' 



Figure 15: Algorithm of policy iteration for maximal circuit mean in max-plus algebras 



